Case Study: How Does a Bike-Share Navigate Speedy Success?
Contents
Case Study: How Does a Bike-Share Navigate Speedy Success?#
Guiding Questions of Analysis?#
How do annual members and casual riders use Cyclistic bikes differently?
Data Gathering#
The data gathered for this analysis is sourced from the publicly open dataset which is provided by the Company Cyclistic under the following Data Licence Agreement.
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as ptly_go
import plotly.offline as ptly_off
import matplotlib.pyplot as plt
import matplotlib.ticker as mat_tick
from pathlib import Path
import math
import seaborn as sns
import locale
import datetime
from IPython.display import Markdown as md
from glob import glob
import heapq
import plotly.io as pio
pio.renderers.default = "notebook"
ptly_off.init_notebook_mode()
locale.setlocale(locale.LC_NUMERIC, 'de_DE')
locale.setlocale(locale.LC_TIME, 'en_US')
%matplotlib inline
# IMPORTING DATA
csv_list = glob('*-tripdata.csv', root_dir="./../raw_data/")
dfs = [
pd.read_csv(f'./../raw_data/{filename}', index_col = None, parse_dates = ['started_at','ended_at'],
dtype={'rideable_type': 'category', 'casual_member': 'category'})
for filename in csv_list
]
df = pd.concat(dfs)
Data Cleaning#
Start by Cleaning the data where there are any N/A values as well as rides that start and end at the exact coordinate values.
# DROP NaN RIDES
# DROP ROWS IF STATIONS = NaN AND START + END LAT/LONG VALUES ARE SAME
df_cleaned = df.dropna(axis=0, how='any',
subset=['start_station_name','start_station_id', 'end_station_name', 'end_station_id'],
inplace = False)
df_cleaned = df_cleaned.loc[(df_cleaned['start_lat'] != df_cleaned['end_lat']) &
(df_cleaned['start_lng'] != df_cleaned['end_lng'])]
md(f"""Rows drecreased by **{locale.format_string('%d', df.shape[0]-df_cleaned.shape[0], grouping=True)}**
from {locale.format_string('%d', df.shape[0], grouping = True)} to {locale.format_string('%d', df_cleaned.shape[0], grouping=True)}.""")
Rows drecreased by 1.507.429 from 5.901.463 to 4.394.034.
md(f"""Check for matching observations (start and end stations should have the same amount of ID's and names)
| Station Type | # of ID's | # of names | DIFF |
| --- | --- | --- | --- |
| Start | {df_cleaned['start_station_id'].nunique()} | {df_cleaned['start_station_name'].nunique()} | {abs(df_cleaned['start_station_id'].nunique() - df_cleaned['start_station_name'].nunique())} |
| End | {df_cleaned['end_station_id'].nunique()} | {df_cleaned['end_station_name'].nunique()} | {abs(df_cleaned['end_station_id'].nunique() - df_cleaned['end_station_name'].nunique())} |
""")
Check for matching observations (start and end stations should have the same amount of ID's and names)
| Station Type | # of ID's | # of names | DIFF | | --- | --- | --- | --- | | Start | 1130 | 1253 | 123 | | End | 1158 | 1286 | 128 |
Since there is a clear difference in the numnber of names to stations there seems to be some kind of irregulatory, probably typos. Sincec we do not have an easy way to filter them out lets create a station dataframe with 1:1 id to name and only take the names with the largest number of appearances. afterwards we will replace the names in the cleaned dataframe with the stations that we found to be the most common names.
# create single df for stations (no difference for start or end)
station_concat = pd.concat([df_cleaned[['start_station_id','start_station_name']]\
.rename(columns={'start_station_id':'station_id', 'start_station_name': 'station_name'}),
df_cleaned[['end_station_id','end_station_name']]\
.rename(columns={'end_station_id':'station_id', 'end_station_name': 'station_name'})])
# groupby stations and count station name
stations_df = station_concat.groupby(['station_id','station_name'])['station_name'].count().rename('count')
# use only speeling with occures most often
station_list = []
for idx, temp_df in stations_df.groupby(level=0):
station_list.append(temp_df.nlargest(1))
stations_df_cleaned = pd.concat(station_list).to_frame()
stations_df_cleaned = stations_df_cleaned.reset_index(1).drop(columns='count')
stations_df_cleaned
stations_dict = {}
for idx, value in stations_df_cleaned.groupby(level=0):
stations_dict.update({idx: value['station_name'].item()})
# change names in df_cleaned
df_cleaned['start_station_name'] = df_cleaned['start_station_id'].apply(lambda x:
stations_dict.get(x))
df_cleaned['end_station_name'] = df_cleaned['end_station_id'].apply(lambda x:
stations_dict.get(x))
Data Preparation#
Adding formatted columns (hour of ride, weekday, month) to further slice data
Calculating distance travlled (point-to-point)
# CALCULATING TIME RELATED DATA
df_cleaned['hour_of_ride'] = df_cleaned['started_at'].dt.hour
df_cleaned['day_of_ride'] = pd.Categorical(df_cleaned['started_at'].dt.day_name(),
categories= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
df_cleaned['month_of_ride'] = pd.Categorical(df_cleaned['started_at'].dt.month_name(),
categories= ['August', 'September', 'October', 'November', 'December',
'January', 'February', 'March', 'April', 'May', 'June', 'July'])
df_cleaned['year_of_ride'] = df_cleaned['started_at'].dt.year
# CALCULATING RIDE TIMES
df_cleaned['length_of_ride_tdelta'] = df_cleaned['ended_at'] - df_cleaned['started_at']
df_cleaned['length_of_ride_s'] = df_cleaned['length_of_ride_tdelta'].apply(lambda x: datetime.timedelta.total_seconds(x))
Calculating the travel distane (Point-to-Point)#
The haversine formula or haversine distance#
To calculate the distance on a sphere we can use the formula: $\( d = r * acos ( sin(\Phi_{1}) * sin(\Phi_{2}) + cos(\Phi_{1}) * cos(\Phi_{1}) * cos(\Delta\lambda)) \)$
Where:
r = radius of Sphere, Earth = ~6.371km
\(\Phi_{1}\) = Latitude of start point
\(\Phi_{2}\) = Latitude of end point
\(\Delta\lambda\) = Delta/Difference of longitude between end point and start point
Source: movable-type.co.uk
# DISTANCE Formula
def calc_sphere_dist(start_lat:float, start_lng:float, end_lat:float, end_lng:float, R:int|float=6371000):
"""Calculate distance between two spherical points using Haversine distane.
Takes coordinates in as angles and returns distance as default for Earth in m.
Args:
start_lat (float): Latitude Point 1 in angle
start_lng (float): Longitude Point 1 in angle
end_lat (float): Latitude Point 2 in angle
end_lng (float): Longitude Point 2 in angle
R (int or float, optional): Radius of the sphere. Defaults to 6371000 (Earth radius in m).
Returns:
float: Returns distance as default for Earth in [m].
"""
# CONVERT TO RADIANS
start_lat_deg = math.radians(start_lat)
start_lng_deg = math.radians(start_lng)
end_lat_deg = math.radians(end_lat)
end_lng_deg = math.radians(end_lng)
# APPLY DISTANCE FORMULA LIKE OUTLINED ABOVE
d = R * math.acos(
math.sin(start_lat_deg) * math.sin(end_lat_deg) +
math.cos(start_lat_deg) * math.cos(end_lat_deg) * math.cos(end_lng_deg - start_lng_deg)
)
return d
# CALCULATE RIDE DISTANCE (POINT-TO-POINT)
df_cleaned['dist_ride'] = df_cleaned.apply(lambda x:
calc_sphere_dist(x['start_lat'], x['start_lng'], x['end_lat'], x['end_lng']), axis = 1)
Data Analysis#
Get a feel for the dataset with calculating some high-level statistics:
Average, Median, Max and Min for the ride time as well as the ride distance
most commonly observed time of day, weekday and month of ride
# FUNCTION TO TURN TIMEDELTA INTO A TIME STRING (NOT BUILT IN FUNCTION)
# seen on https://stackoverflow.com/questions/8906926/formatting-timedelta-objects
from string import Template
def strfdelta(tdelta:datetime.timedelta, fmt:str="%H:%M:%S"):
"""Format a timedelta object into a time string like strf function
Args:
tdelta (timedelta object): Timedelta object to be converted to a time string
fmt (str, optional): Format string to specify desired output format. Default: "%H:%M:%S"
Returns:
str: Returns string in timeformat.
"""
class DeltaTemplate(Template):
delimiter = "%"
d = {"D": tdelta.days}
d["H"], rem = divmod(tdelta.seconds, 3600)
d["H"] = str(d["H"]).zfill(2)
d["M"], d["S"] = divmod(rem, 60)
d["M"] = str(d["M"]).zfill(2)
d["S"] = str(d["S"]).zfill(2)
t = DeltaTemplate(fmt)
return t.substitute(**d)
High Level Stats - 1#
# CALCULATE DESCRIPTIVE STATISTICS
avg_ride_timedelta = strfdelta(df_cleaned['length_of_ride_tdelta'].mean(), '%H:%M:%S')
median_ride_timedelta = strfdelta(df_cleaned['length_of_ride_tdelta'].median(), '%H:%M:%S')
max_ride_timedelta = strfdelta(df_cleaned['length_of_ride_tdelta'].max(), '%D days %H:%M:%S')
min_ride_timedelta = strfdelta(df_cleaned['length_of_ride_tdelta'].min(), '%D days %H:%M:%S')
avg_ride_dist = locale.format_string('%d', round(df_cleaned['dist_ride'].mean(), 2), grouping=True)
median_ride_dist = locale.format_string('%d',round(df_cleaned['dist_ride'].median(), 2), grouping=True)
max_ride_dist = locale.format_string('%d',round(df_cleaned['dist_ride'].max(), 2), grouping=True)
min_ride_dist = locale.format_string('%d',round(df_cleaned['dist_ride'].min(), 2), grouping=True)
most_common_hour = df_cleaned['hour_of_ride'].mode()[0]
most_common_day = df_cleaned['day_of_ride'].mode()[0]
most_common_month = df_cleaned['month_of_ride'].mode()[0]
most_common_start_station = df_cleaned['start_station_name'].mode()[0]
most_common_end_station = df_cleaned['end_station_name'].mode()[0]
dist_time_corr = df_cleaned['length_of_ride_s'].corr(df_cleaned['dist_ride'])
df_sorted = df_cleaned.sort_values(by=['dist_ride'], ascending=False)
count_rides = df_cleaned['ride_id'].count()
md(f"""The high-level statistics shows the following results:<br>
Number of observations: **{locale.format_string('%d', count_rides, grouping = True)}**
<br>
Correlation between time and distance is:
r = **{round(dist_time_corr,5)}**
<br>
| Variable | Average | Median | Max | Min |
|----------| :-: | :-: | :-: | :-: |
| ride length [t] | {avg_ride_timedelta} | {median_ride_timedelta} | {max_ride_timedelta} | {min_ride_timedelta} |
| ride distance [m] | {avg_ride_dist} | {median_ride_dist} | {max_ride_dist} | {min_ride_dist} |
<br>
| Variable | Mode (most common) |
| --- | :-: |
| hour of ride | {most_common_hour} |
| day of ride | {most_common_day} |
| month of ride | {most_common_month} |
| start station | {most_common_start_station} |
| end station | {most_common_end_station} |
<br>
The data shows that there is a Min value of ride time that is negative, indicating that the end time was before the start time. The Min value of ride dist also shows 0 distance travelled (although that was supposed to be filtered out by same Lat + Long before).
Also there seems to be a stark outlier with over {locale. format_string('%d', round(int(max_ride_dist.replace(".",""))/1000,1), grouping=True)}km travelled. On further inspection this was done with an electrik bike however the length of the bike ride was too short to make sense. Add new rule to filter out rides > 200km
Those data points are invalid and need to be filtered out before proceeding with more analysis.
""")
The high-level statistics shows the following results:
Number of observations: 4.394.034
Correlation between time and distance is:
r = 0.09955
| Variable | Average | Median | Max | Min | |----------| :-: | :-: | :-: | :-: | | ride length [t] | 00:17:39 | 00:11:06 | 28 days 21:49:10 | -1 days 21:50:55 | | ride distance [m] | 2.221 | 1.667 | 1.189.522 | 0 |
| Variable | Mode (most common) | | --- | :-: | | hour of ride | 17 | | day of ride | Saturday | | month of ride | August | | start station | Streeter Dr & Grand Ave | | end station | Streeter Dr & Grand Ave |
The data shows that there is a Min value of ride time that is negative, indicating that the end time was before the start time. The Min value of ride dist also shows 0 distance travelled (although that was supposed to be filtered out by same Lat + Long before).
Also there seems to be a stark outlier with over 1.189km travelled. On further inspection this was done with an electrik bike however the length of the bike ride was too short to make sense. Add new rule to filter out rides > 200km
Those data points are invalid and need to be filtered out before proceeding with more analysis.
Data Cleaning - Step 1#
Filter out rows with the following conditios:
length of travel as timedelta < 0 (negative) OR
ride distane in m is <= 0 OR
ride distance in m is > 200.000
# MORE CLEANING ACCORDING TO CRITERIA ABOVE
df_cleaned_v2 = df_cleaned.loc[(df_cleaned['dist_ride'] > 0) &
(df_cleaned['length_of_ride_tdelta']>datetime.timedelta(seconds=1)) &
(df_cleaned['dist_ride'] < 200000)]
md(f"""Rows drecreased by **{locale.format_string('%d', df_cleaned.shape[0]-df_cleaned_v2.shape[0], grouping = True)}**
from {locale.format_string('%d', df_cleaned.shape[0], grouping = True)}
to {locale.format_string('%d', df_cleaned_v2.shape[0], grouping = True)}.""")
Rows drecreased by 315 from 4.394.034 to 4.393.719.
High Level Stats - 2#
# RE-CALCULATE DESCRIPTIVE STAISTICS
avg_ride_timedelta_v2 = strfdelta(df_cleaned_v2['length_of_ride_tdelta'].mean(), '%H:%M:%S')
median_ride_timedelta_v2 = strfdelta(df_cleaned_v2['length_of_ride_tdelta'].median(), '%H:%M:%S')
max_ride_timedelta_v2 = strfdelta(df_cleaned_v2['length_of_ride_tdelta'].max(), '%D days %H:%M:%S')
min_ride_timedelta_v2 = strfdelta(df_cleaned_v2['length_of_ride_tdelta'].min(), '%D days %H:%M:%S')
avg_ride_dist_v2 = locale.format_string('%d', round(df_cleaned_v2['dist_ride'].mean(), 2), grouping=True)
median_ride_dist_v2 = locale.format_string('%d',round(df_cleaned_v2['dist_ride'].median(), 2), grouping=True)
max_ride_dist_v2 = locale.format_string('%d',round(df_cleaned_v2['dist_ride'].max(), 2), grouping=True)
min_ride_dist_v2 = locale.format_string('%d',round(df_cleaned_v2['dist_ride'].min(), 2), grouping=True)
most_common_hour_v2 = df_cleaned_v2['hour_of_ride'].mode()[0]
most_common_day_v2 = df_cleaned_v2['day_of_ride'].mode()[0]
most_common_month_v2 = df_cleaned_v2['month_of_ride'].mode()[0]
most_common_start_station_v2 = df_cleaned_v2['start_station_name'].mode()[0]
most_common_end_station_v2 = df_cleaned_v2['end_station_name'].mode()[0]
dist_time_corr_v2 = df_cleaned_v2['length_of_ride_s'].corr(df_cleaned_v2['dist_ride'])
df_max_ride_dist = df_cleaned_v2.loc[(df_cleaned_v2['dist_ride'] == df_cleaned_v2['dist_ride'].max())]
count_rides_v2 = df_cleaned_v2['ride_id'].count()
md(f"""The high-level statistics shows the following results:<br>
Number of observations: {locale.format_string('%d', count_rides_v2, grouping = True)}
<br>
Correlation between time and distance is:
r = {round(dist_time_corr_v2,5)}
| Variable | Average | Median | Max | Min |
|----------| :-: | :-: | :-: | :-: |
| ride length [t] | {avg_ride_timedelta_v2} | {median_ride_timedelta_v2} | {max_ride_timedelta_v2} | {min_ride_timedelta_v2} |
| ride distance [m] | {avg_ride_dist_v2} | {median_ride_dist_v2} | {max_ride_dist_v2} | {min_ride_dist_v2} |
<br>
| Time of ride | Mode (most common) |
| --- | :-: |
| hour of ride | {most_common_hour_v2} |
| day of ride | {most_common_day_v2} |
| month of ride | {most_common_month_v2} |
| start station | {most_common_start_station_v2} |
| end station | {most_common_end_station_v2} |
Although there all rides with same start coordinates have been filtered out, there are still rides with no or very little
distance travelled. Also there are rides that only last a few seconds. To get a beter understanding of "real" rides, e.g. rides
that are used to travel somewhere and are not "accidentally" unlocked, a new rule will be applied: rides have to be farther than 100m
and longer than 1 min.
""")
The high-level statistics shows the following results:
Number of observations: 4.393.719
Correlation between time and distance is:
r = 0.10411
| Variable | Average | Median | Max | Min | |----------| :-: | :-: | :-: | :-: | | ride length [t] | 00:17:39 | 00:11:06 | 28 days 21:49:10 | 0 days 00:00:02 | | ride distance [m] | 2.221 | 1.667 | 31.906 | 0 |
| Time of ride | Mode (most common) | | --- | :-: | | hour of ride | 17 | | day of ride | Saturday | | month of ride | August | | start station | Streeter Dr & Grand Ave | | end station | Streeter Dr & Grand Ave |
Although there all rides with same start coordinates have been filtered out, there are still rides with no or very little distance travelled. Also there are rides that only last a few seconds. To get a beter understanding of "real" rides, e.g. rides that are used to travel somewhere and are not "accidentally" unlocked, a new rule will be applied: rides have to be farther than 100m and longer than 1 min.
Data Cleaning - Step 2#
# NARROWING DOWN DATA SET TO MATCH ACTUAL DEMAND DIST > 100m and RIDE TIME > 1 MIN
df_cleaned_v3 = df_cleaned_v2.loc[(df_cleaned_v2['dist_ride'] > 100) &
(df_cleaned_v2['length_of_ride_tdelta']>datetime.timedelta(minutes=1))]
md(f"""Rows drecreased by **{locale.format_string('%d', df_cleaned_v2.shape[0]-df_cleaned_v3.shape[0], grouping = True)}**
from {locale.format_string('%d', df_cleaned_v2.shape[0], grouping = True)}
to {locale.format_string('%d', df_cleaned_v3.shape[0], grouping = True)}.<br>
In total Rows decreased by **{locale.format_string('%d', df.shape[0]-df_cleaned_v3.shape[0], grouping = True)}** from original dataset.
""")
Rows drecreased by 78.917
from 4.393.719
to 4.314.802.
In total Rows decreased by 1.586.661 from original dataset.
# CREATE SUBCATEGORY FOR DIST TRAVLLED FROM ULTRA-SHORT TO EXTRA-LONG
def map_distance(x):
map = {'< 500': 'Short', '< 1000': 'Short-Medium', '< 2500': 'Medium',
'< 5000': 'Medium-Long', '<10000': 'Long', '>= 10000': 'Extra-Long'}
for key, value in map.items():
if eval(f'{x} {key}'):
return value
break
df_cleaned_v3['dist_class'] = pd.Categorical(df_cleaned_v3['dist_ride'].apply(lambda x:map_distance(x)),
categories=['Short','Short-Medium', 'Medium', 'Medium-Long', 'Long', 'Extra-Long'])
C:\Users\kemke\AppData\Local\Temp\ipykernel_9028\784846940.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# EpxORTING CLEANED DATA TO CSV
df_cleaned_v3.to_csv('.\..\data\cleaned_data.csv', float_format="%.2f", index=False)
IMPROT DF for quick restart and PURGE OBSOLETE DATA#
High Level Stats - 3#
# RE-CALCULATE DESCRIPTIVE STAISTICS
avg_ride_timedelta_v3 = strfdelta(df_cleaned_v3['length_of_ride_tdelta'].mean(), '%H:%M:%S')
median_ride_timedelta_v3 = strfdelta(df_cleaned_v3['length_of_ride_tdelta'].median(), '%H:%M:%S')
max_ride_timedelta_v3 = strfdelta(df_cleaned_v3['length_of_ride_tdelta'].max(), '%D days %H:%M:%S')
min_ride_timedelta_v3 = strfdelta(df_cleaned_v3['length_of_ride_tdelta'].min(), '%D days %H:%M:%S')
avg_ride_dist_v3 = locale.format_string('%d', round(df_cleaned_v3['dist_ride'].mean(), 2), grouping=True)
median_ride_dist_v3 = locale.format_string('%d',round(df_cleaned_v3['dist_ride'].median(), 2), grouping=True)
max_ride_dist_v3 = locale.format_string('%d',round(df_cleaned_v3['dist_ride'].max(), 2), grouping=True)
min_ride_dist_v3 = locale.format_string('%d',round(df_cleaned_v3['dist_ride'].min(), 2), grouping=True)
most_common_hour_v3 = df_cleaned_v3['hour_of_ride'].mode()[0]
most_common_day_v3 = df_cleaned_v3['day_of_ride'].mode()[0]
most_common_month_v3 = df_cleaned_v3['month_of_ride'].mode()[0]
most_common_start_station_v3 = df_cleaned_v3['start_station_name'].mode()[0]
most_common_end_station_v3 = df_cleaned_v3['end_station_name'].mode()[0]
dist_time_corr_v3 = df_cleaned_v3['length_of_ride_s'].corr(df_cleaned_v3['dist_ride'])
count_rides_v3= df_cleaned_v3['ride_id'].count()
md(f"""The high-level statistics shows the following results:
<br>
Number of observations: **{locale.format_string('%d', count_rides_v3, grouping = True)}**
<br>
Correlation between time and distance is:
r = **{round(dist_time_corr_v3,5)}**
| Variable | Average | Median | Max | Min |
|----------| :-: | :-: | :-: | :-: |
| ride length [t] | {avg_ride_timedelta_v3} | {median_ride_timedelta_v3} | {max_ride_timedelta_v3} | {min_ride_timedelta_v3} |
| ride distance [m] | {avg_ride_dist_v3} | {median_ride_dist_v3} | {max_ride_dist_v3} | {min_ride_dist_v3} |
<br>
| Time of ride | Mode (most common) |
| --- | :-: |
| hour of ride | {most_common_hour_v3} |
| day of ride | {most_common_day_v3} |
| month of ride | {most_common_month_v3} |
| start station | {most_common_start_station_v3} |
| end station | {most_common_end_station_v3} |
""")
The high-level statistics shows the following results:
Number of observations: 4.314.802
Correlation between time and distance is:
r = 0.10584
| Variable | Average | Median | Max | Min | |----------| :-: | :-: | :-: | :-: | | ride length [t] | 00:17:38 | 00:11:09 | 28 days 21:49:10 | 0 days 00:01:01 | | ride distance [m] | 2.261 | 1.697 | 31.906 | 100 |
| Time of ride | Mode (most common) | | --- | :-: | | hour of ride | 17 | | day of ride | Saturday | | month of ride | August | | start station | Streeter Dr & Grand Ave | | end station | Streeter Dr & Grand Ave |
Deep Dive - Statistics via Pivot#
Taking a deeper look into the cleaned data and aggregating the data into different pivots
Pivot - Member-Type#
# GROUPING STATISTICS
df_groupby_member = df_cleaned_v3.groupby('member_casual')\
.apply(lambda df: pd.Series({
'count_of_rides': locale.format_string('%d', df['ride_id'].count(), grouping = True),
'r' : round(df['length_of_ride_s'].corr(df['dist_ride']), 5),
'common_hour': df['hour_of_ride'].mode()[0],
'most_common_day': df['day_of_ride'].mode()[0],
'most_common_month': df['month_of_ride'].mode()[0],
'avg_time': strfdelta(df['length_of_ride_tdelta'].mean(), '%H:%M:%S'),
'median_time': strfdelta(df['length_of_ride_tdelta'].median(), '%H:%M:%S'),
'std_time': strfdelta(df['length_of_ride_tdelta'].std(), '%H:%M:%S'),
'max_time': df['length_of_ride_tdelta'].max(),
'min_time': strfdelta(df['length_of_ride_tdelta'].min(), '%H:%M:%S'),
'avg_dist': round(df['dist_ride'].mean(), 0),
'median_dist': round(df['dist_ride'].median(), 0),
'std_dist': round(df['dist_ride'].std(), 0),
'max_dist': round(df['dist_ride'].max(), 0),
'min_dist': round(df['dist_ride'].min(), 0),
}))
df_groupby_member
| count_of_rides | r | common_hour | most_common_day | most_common_month | avg_time | median_time | std_time | max_time | min_time | avg_dist | median_dist | std_dist | max_dist | min_dist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| member_casual | |||||||||||||||
| Casual | 1.753.742 | 0.07688 | 17 | Saturday | August | 00:25:03 | 00:14:45 | 02:06:47 | 28 days 21:49:10 | 00:01:01 | 2437.0 | 1884.0 | 1934.0 | 31906.0 | 100.0 |
| Member | 2.561.060 | 0.41390 | 17 | Tuesday | August | 00:12:33 | 00:09:15 | 00:17:12 | 1 days 00:52:55 | 00:01:01 | 2142.0 | 1582.0 | 1776.0 | 27488.0 | 100.0 |
Pivot - Rideable-Type#
# GROUPING STATISTICS
df_groupby_ride = df_cleaned_v3.groupby('rideable_type') \
.apply(lambda df: pd.Series({
'count_of_rides': locale.format_string('%d', df['ride_id'].count(), grouping = True),
'r' : round(df['length_of_ride_s'].corr(df['dist_ride']), 5),
'common_hour': df['hour_of_ride'].mode()[0],
'most_common_day': df['day_of_ride'].mode()[0],
'avg_time': strfdelta(df['length_of_ride_tdelta'].mean(), '%H:%M:%S'),
'median_time': strfdelta(df['length_of_ride_tdelta'].median(), '%H:%M:%S'),
'std_time': strfdelta(df['length_of_ride_tdelta'].std(), '%H:%M:%S'),
'max_time': df['length_of_ride_tdelta'].max(),
'min_time': strfdelta(df['length_of_ride_tdelta'].min(), '%H:%M:%S'),
'avg_dist': round(df['dist_ride'].mean(), 0),
'median_dist': round(df['dist_ride'].median(), 0),
'std_dist': round(df['dist_ride'].std(), 0),
'max_dist': round(df['dist_ride'].max(), 0),
'min_dist': round(df['dist_ride'].min(), 0),
}))
df_groupby_ride
| count_of_rides | r | common_hour | most_common_day | avg_time | median_time | std_time | max_time | min_time | avg_dist | median_dist | std_dist | max_dist | min_dist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rideable_type | ||||||||||||||
| Classic Bike | 2.863.190 | 0.29272 | 17 | Saturday | 00:16:47 | 00:11:12 | 00:29:30 | 1 days 00:59:25 | 00:01:01 | 2137.0 | 1608.0 | 1739.0 | 30186.0 | 102.0 |
| Docked Bike | 184.788 | 0.04627 | 16 | Saturday | 00:55:20 | 00:25:51 | 06:15:34 | 28 days 21:49:10 | 00:01:10 | 2639.0 | 1973.0 | 2244.0 | 31906.0 | 102.0 |
| Electric Bike | 1.266.824 | 0.46330 | 17 | Thursday | 00:14:03 | 00:10:00 | 00:14:44 | 0 days 07:59:55 | 00:01:01 | 2489.0 | 1899.0 | 1987.0 | 28741.0 | 100.0 |
Pivot - Month#
# GROUPING STATISTICS
df_groupby_month = df_cleaned_v3.groupby('month_of_ride')\
.apply(lambda df: pd.Series({
'count_of_rides': locale.format_string('%d', df['ride_id'].count(), grouping = True),
'r' : round(df['length_of_ride_s'].corr(df['dist_ride']), 5),
'common_hour': df['hour_of_ride'].mode()[0],
'most_common_day': df['day_of_ride'].mode()[0],
'avg_time': strfdelta(df['length_of_ride_tdelta'].mean(), '%H:%M:%S'),
'median_time': strfdelta(df['length_of_ride_tdelta'].median(), '%H:%M:%S'),
'std_time': strfdelta(df['length_of_ride_tdelta'].std(), '%H:%M:%S'),
'max_time': df['length_of_ride_tdelta'].max(),
'min_time': strfdelta(df['length_of_ride_tdelta'].min(), '%H:%M:%S'),
'avg_dist': round(df['dist_ride'].mean(), 0),
'median_dist': round(df['dist_ride'].median(), 0),
'std_dist': round(df['dist_ride'].std(), 0),
'max_dist': round(df['dist_ride'].max(), 0),
'min_dist': round(df['dist_ride'].min(), 0),
}))
df_groupby_month
| count_of_rides | r | common_hour | most_common_day | avg_time | median_time | std_time | max_time | min_time | avg_dist | median_dist | std_dist | max_dist | min_dist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| month_of_ride | ||||||||||||||
| August | 627.257 | 0.09377 | 17 | Sunday | 00:19:57 | 00:12:37 | 01:33:06 | 28 days 21:49:10 | 00:01:01 | 2357.0 | 1808.0 | 1878.0 | 31906.0 | 100.0 |
| September | 580.808 | 0.08451 | 17 | Saturday | 00:19:12 | 00:11:59 | 01:53:41 | 22 days 19:38:32 | 00:01:01 | 2331.0 | 1769.0 | 1879.0 | 29389.0 | 100.0 |
| October | 449.826 | 0.08579 | 17 | Saturday | 00:16:58 | 00:10:33 | 01:40:30 | 28 days 06:25:01 | 00:01:01 | 2190.0 | 1641.0 | 1792.0 | 28915.0 | 100.0 |
| November | 242.781 | 0.10864 | 17 | Tuesday | 00:13:35 | 00:08:51 | 01:08:55 | 13 days 08:37:49 | 00:01:01 | 2002.0 | 1492.0 | 1662.0 | 23446.0 | 101.0 |
| December | 167.342 | 0.06369 | 17 | Thursday | 00:13:45 | 00:08:34 | 02:10:13 | 21 days 02:40:33 | 00:01:01 | 1976.0 | 1480.0 | 1642.0 | 22524.0 | 101.0 |
| January | 76.473 | 0.05098 | 17 | Thursday | 00:12:36 | 00:07:48 | 02:43:07 | 20 days 07:51:06 | 00:01:01 | 1780.0 | 1325.0 | 1487.0 | 23263.0 | 102.0 |
| February | 84.707 | 0.11718 | 17 | Monday | 00:12:35 | 00:08:04 | 00:58:38 | 7 days 13:45:58 | 00:01:01 | 1855.0 | 1357.0 | 1579.0 | 24700.0 | 102.0 |
| March | 202.153 | 0.09013 | 17 | Wednesday | 00:16:07 | 00:09:38 | 01:37:50 | 23 days 20:34:04 | 00:01:01 | 2134.0 | 1581.0 | 1793.0 | 29734.0 | 102.0 |
| April | 255.224 | 0.22528 | 17 | Saturday | 00:15:37 | 00:09:48 | 00:36:39 | 5 days 05:45:58 | 00:01:01 | 2148.0 | 1598.0 | 1792.0 | 21637.0 | 100.0 |
| May | 462.972 | 0.23225 | 17 | Monday | 00:18:34 | 00:11:54 | 00:37:05 | 7 days 10:42:58 | 00:01:01 | 2343.0 | 1760.0 | 1917.0 | 28036.0 | 101.0 |
| June | 573.835 | 0.25961 | 17 | Thursday | 00:18:12 | 00:12:08 | 00:32:01 | 4 days 15:12:52 | 00:01:01 | 2370.0 | 1809.0 | 1897.0 | 30186.0 | 100.0 |
| July | 591.424 | 0.15218 | 17 | Saturday | 00:18:22 | 00:12:03 | 00:54:56 | 22 days 05:55:27 | 00:01:01 | 2381.0 | 1802.0 | 1921.0 | 29685.0 | 101.0 |
Pivot - Member-Type & Month#
# GROUPING STATISTICS
df_groupby_member_month = df_cleaned_v3.groupby(['member_casual','month_of_ride']) \
.apply(lambda df: pd.Series({
'count_of_rides': locale.format_string('%d', df['ride_id'].count(), grouping = True),
'r' : round(df['length_of_ride_s'].corr(df['dist_ride']), 5),
'common_hour': df['hour_of_ride'].mode()[0],
'most_common_day': df['day_of_ride'].mode()[0],
'avg_time': strfdelta(df['length_of_ride_tdelta'].mean(), '%H:%M:%S'),
'median_time': strfdelta(df['length_of_ride_tdelta'].median(), '%H:%M:%S'),
'std_time': strfdelta(df['length_of_ride_tdelta'].std(), '%H:%M:%S'),
'max_time': df['length_of_ride_tdelta'].max(),
'min_time': strfdelta(df['length_of_ride_tdelta'].min(), '%H:%M:%S'),
'avg_dist': round(df['dist_ride'].mean(), 0),
'median_dist': round(df['dist_ride'].median(), 0),
'std_dist': round(df['dist_ride'].std(), 0),
'max_dist': round(df['dist_ride'].max(), 0),
'min_dist': round(df['dist_ride'].min(), 0),
}))
df_groupby_member_month
| count_of_rides | r | common_hour | most_common_day | avg_time | median_time | std_time | max_time | min_time | avg_dist | median_dist | std_dist | max_dist | min_dist | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| member_casual | month_of_ride | ||||||||||||||
| Casual | August | 308.612 | 0.07084 | 17 | Saturday | 00:26:39 | 00:15:52 | 02:11:25 | 28 days 21:49:10 | 00:01:01 | 2483.0 | 1937.0 | 1937.0 | 31906.0 | 100.0 |
| September | 266.430 | 0.06643 | 17 | Saturday | 00:26:27 | 00:15:15 | 02:46:39 | 22 days 19:38:32 | 00:01:01 | 2491.0 | 1933.0 | 1953.0 | 29389.0 | 102.0 | |
| October | 173.455 | 0.06243 | 17 | Saturday | 00:24:57 | 00:14:01 | 02:40:22 | 28 days 06:25:01 | 00:01:02 | 2390.0 | 1850.0 | 1883.0 | 28915.0 | 100.0 | |
| November | 64.296 | 0.07135 | 17 | Saturday | 00:21:01 | 00:11:35 | 02:10:44 | 13 days 08:37:49 | 00:01:01 | 2178.0 | 1666.0 | 1718.0 | 23446.0 | 102.0 | |
| December | 41.098 | 0.05152 | 16 | Friday | 00:23:41 | 00:11:14 | 04:21:22 | 21 days 02:40:33 | 00:01:06 | 2116.0 | 1624.0 | 1690.0 | 22524.0 | 102.0 | |
| January | 11.539 | 0.04988 | 17 | Saturday | 00:26:08 | 00:09:55 | 06:58:20 | 20 days 07:51:06 | 00:01:05 | 1900.0 | 1495.0 | 1438.0 | 15667.0 | 102.0 | |
| February | 13.738 | 0.06506 | 17 | Monday | 00:22:41 | 00:10:52 | 02:18:54 | 7 days 13:45:58 | 00:01:03 | 2083.0 | 1597.0 | 1707.0 | 24700.0 | 103.0 | |
| March | 59.337 | 0.06383 | 17 | Sunday | 00:26:42 | 00:14:34 | 02:56:55 | 23 days 20:34:04 | 00:01:03 | 2397.0 | 1828.0 | 1950.0 | 29734.0 | 102.0 | |
| April | 82.006 | 0.16807 | 17 | Saturday | 00:24:14 | 00:14:24 | 00:56:10 | 5 days 05:45:58 | 00:01:01 | 2419.0 | 1850.0 | 1952.0 | 21637.0 | 101.0 | |
| May | 194.465 | 0.17729 | 17 | Sunday | 00:25:55 | 00:15:48 | 00:52:20 | 7 days 10:42:58 | 00:01:01 | 2501.0 | 1917.0 | 2017.0 | 28036.0 | 101.0 | |
| June | 261.320 | 0.20812 | 17 | Saturday | 00:23:37 | 00:14:51 | 00:42:25 | 4 days 15:12:52 | 00:01:01 | 2457.0 | 1908.0 | 1941.0 | 30186.0 | 100.0 | |
| July | 277.446 | 0.11493 | 17 | Saturday | 00:23:51 | 00:14:44 | 01:17:58 | 22 days 05:55:27 | 00:01:01 | 2459.0 | 1894.0 | 1962.0 | 29685.0 | 102.0 | |
| Member | August | 318.645 | 0.45973 | 17 | Tuesday | 00:13:27 | 00:10:08 | 00:15:51 | 1 days 00:46:23 | 00:01:01 | 2235.0 | 1673.0 | 1810.0 | 26344.0 | 101.0 |
| September | 314.378 | 0.45343 | 17 | Thursday | 00:13:03 | 00:09:48 | 00:16:10 | 0 days 21:58:24 | 00:01:01 | 2196.0 | 1632.0 | 1803.0 | 25227.0 | 100.0 | |
| October | 276.371 | 0.44390 | 17 | Saturday | 00:11:58 | 00:08:54 | 00:15:18 | 0 days 23:35:08 | 00:01:01 | 2065.0 | 1520.0 | 1720.0 | 23432.0 | 101.0 | |
| November | 178.485 | 0.38527 | 17 | Tuesday | 00:10:54 | 00:08:01 | 00:16:40 | 1 days 00:20:34 | 00:01:01 | 1939.0 | 1409.0 | 1636.0 | 22073.0 | 101.0 | |
| December | 126.244 | 0.43902 | 17 | Thursday | 00:10:31 | 00:07:51 | 00:14:05 | 0 days 20:30:52 | 00:01:01 | 1931.0 | 1418.0 | 1624.0 | 20355.0 | 101.0 | |
| January | 64.934 | 0.41840 | 17 | Thursday | 00:10:12 | 00:07:29 | 00:14:13 | 0 days 23:00:00 | 00:01:01 | 1758.0 | 1289.0 | 1495.0 | 23263.0 | 102.0 | |
| February | 70.969 | 0.32736 | 17 | Monday | 00:10:38 | 00:07:37 | 00:18:36 | 1 days 00:26:24 | 00:01:01 | 1811.0 | 1314.0 | 1549.0 | 21213.0 | 102.0 | |
| March | 142.816 | 0.30966 | 17 | Wednesday | 00:11:43 | 00:08:16 | 00:21:52 | 1 days 00:38:29 | 00:01:01 | 2024.0 | 1479.0 | 1712.0 | 25216.0 | 102.0 | |
| April | 173.218 | 0.32382 | 17 | Tuesday | 00:11:33 | 00:08:17 | 00:20:51 | 1 days 00:52:55 | 00:01:01 | 2020.0 | 1479.0 | 1696.0 | 21232.0 | 100.0 | |
| May | 268.507 | 0.41082 | 17 | Monday | 00:13:14 | 00:09:44 | 00:17:53 | 1 days 00:35:38 | 00:01:01 | 2229.0 | 1641.0 | 1833.0 | 23343.0 | 101.0 | |
| June | 312.515 | 0.40993 | 17 | Thursday | 00:13:40 | 00:10:14 | 00:18:17 | 0 days 22:03:32 | 00:01:01 | 2297.0 | 1711.0 | 1855.0 | 24669.0 | 102.0 | |
| July | 313.978 | 0.45942 | 17 | Saturday | 00:13:31 | 00:10:08 | 00:16:11 | 1 days 00:36:42 | 00:01:01 | 2312.0 | 1714.0 | 1882.0 | 27488.0 | 101.0 |
Pivot - Member-Type & Day#
# GROUPING STATISTICS
df_groupby_member_day = df_cleaned_v3.groupby(['member_casual','day_of_ride']) \
.apply(lambda df: pd.Series({
'count_of_rides': locale.format_string('%d', df['ride_id'].count(), grouping = True),
'r' : round(df['length_of_ride_s'].corr(df['dist_ride']), 5),
'common_hour': df['hour_of_ride'].mode()[0],
'avg_time': strfdelta(df['length_of_ride_tdelta'].mean(), '%H:%M:%S'),
'median_time': strfdelta(df['length_of_ride_tdelta'].median(), '%H:%M:%S'),
'std_time': strfdelta(df['length_of_ride_tdelta'].std(), '%H:%M:%S'),
'max_time': df['length_of_ride_tdelta'].max(),
'min_time': strfdelta(df['length_of_ride_tdelta'].min(), '%H:%M:%S'),
'avg_dist': round(df['dist_ride'].mean(), 0),
'median_dist': round(df['dist_ride'].median(), 0),
'std_dist': round(df['dist_ride'].std(), 0),
'max_dist': round(df['dist_ride'].max(), 0),
'min_dist': round(df['dist_ride'].min(), 0),
}))
df_groupby_member_day
| count_of_rides | r | common_hour | avg_time | median_time | std_time | max_time | min_time | avg_dist | median_dist | std_dist | max_dist | min_dist | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| member_casual | day_of_ride | |||||||||||||
| Casual | Monday | 202.460 | 0.08057 | 17 | 00:25:30 | 00:14:39 | 02:03:54 | 22 days 05:55:27 | 00:01:02 | 2380.0 | 1828.0 | 1923.0 | 29038.0 | 100.0 |
| Tuesday | 185.023 | 0.11569 | 17 | 00:21:31 | 00:12:40 | 01:15:14 | 10 days 18:02:58 | 00:01:01 | 2319.0 | 1785.0 | 1843.0 | 28036.0 | 100.0 | |
| Wednesday | 191.032 | 0.07499 | 17 | 00:21:15 | 00:12:41 | 01:57:59 | 21 days 02:40:33 | 00:01:01 | 2323.0 | 1807.0 | 1816.0 | 31906.0 | 100.0 | |
| Thursday | 215.972 | 0.06820 | 17 | 00:22:01 | 00:13:00 | 02:02:42 | 18 days 19:22:48 | 00:01:01 | 2354.0 | 1827.0 | 1850.0 | 29389.0 | 101.0 | |
| Friday | 240.037 | 0.07902 | 17 | 00:23:26 | 00:13:55 | 01:52:18 | 15 days 17:09:54 | 00:01:01 | 2359.0 | 1825.0 | 1846.0 | 30186.0 | 100.0 | |
| Saturday | 381.328 | 0.08328 | 15 | 00:27:47 | 00:16:55 | 02:12:40 | 28 days 06:25:01 | 00:01:01 | 2565.0 | 1999.0 | 2019.0 | 29685.0 | 102.0 | |
| Sunday | 337.890 | 0.06255 | 15 | 00:28:54 | 00:17:08 | 02:37:23 | 28 days 21:49:10 | 00:01:02 | 2561.0 | 1980.0 | 2050.0 | 29734.0 | 101.0 | |
| Member | Monday | 360.537 | 0.42844 | 17 | 00:12:06 | 00:08:54 | 00:16:11 | 1 days 00:52:55 | 00:01:01 | 2109.0 | 1538.0 | 1765.0 | 25227.0 | 101.0 |
| Tuesday | 403.168 | 0.42333 | 17 | 00:11:42 | 00:08:45 | 00:16:00 | 1 days 00:22:02 | 00:01:01 | 2090.0 | 1527.0 | 1752.0 | 23330.0 | 100.0 | |
| Wednesday | 401.858 | 0.41321 | 17 | 00:11:50 | 00:08:53 | 00:16:22 | 1 days 00:20:34 | 00:01:01 | 2110.0 | 1554.0 | 1753.0 | 24669.0 | 101.0 | |
| Thursday | 397.578 | 0.40522 | 17 | 00:12:01 | 00:08:57 | 00:17:02 | 1 days 00:35:38 | 00:01:01 | 2116.0 | 1555.0 | 1763.0 | 26344.0 | 101.0 | |
| Friday | 349.014 | 0.40803 | 17 | 00:12:13 | 00:09:03 | 00:16:52 | 1 days 00:15:18 | 00:01:01 | 2092.0 | 1542.0 | 1739.0 | 27488.0 | 101.0 | |
| Saturday | 337.896 | 0.40333 | 12 | 00:14:14 | 00:10:28 | 00:19:26 | 0 days 23:37:03 | 00:01:01 | 2260.0 | 1699.0 | 1827.0 | 25224.0 | 101.0 | |
| Sunday | 311.009 | 0.41339 | 15 | 00:14:17 | 00:10:23 | 00:18:39 | 1 days 00:46:23 | 00:01:01 | 2253.0 | 1681.0 | 1838.0 | 25216.0 | 102.0 |
Pivot - Member-Type & Ride-Type & Month#
# GROUPING STATISTICS
df_groupby_member_ride_month = df_cleaned_v3.groupby(['member_casual',
'rideable_type','month_of_ride']).apply(lambda df: pd.Series({
'count_of_rides': locale.format_string('%d', df['ride_id'].count(), grouping = True),
'r' : round(df['length_of_ride_s'].corr(df['dist_ride']), 5),
'common_hour': df['hour_of_ride'].mode()[0],
'avg_time': strfdelta(df['length_of_ride_tdelta'].mean(), '%H:%M:%S'),
'median_time': strfdelta(df['length_of_ride_tdelta'].median(), '%H:%M:%S'),
'std_time': strfdelta(df['length_of_ride_tdelta'].std(), '%H:%M:%S'),
'max_time': df['length_of_ride_tdelta'].max(),
'min_time': strfdelta(df['length_of_ride_tdelta'].min(), '%H:%M:%S'),
'avg_dist': round(df['dist_ride'].mean(), 0),
'median_dist': round(df['dist_ride'].median(), 0),
'std_dist': round(df['dist_ride'].std(), 0),
'max_dist': round(df['dist_ride'].max(), 0),
'min_dist': round(df['dist_ride'].min(), 0),
}))
df_groupby_member_ride_month
| count_of_rides | r | common_hour | avg_time | median_time | std_time | max_time | min_time | avg_dist | median_dist | std_dist | max_dist | min_dist | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| member_casual | rideable_type | month_of_ride | |||||||||||||
| Casual | Classic Bike | August | 209.947 | 0.24366 | 17.0 | 00:23:29 | 00:15:24 | 00:36:59 | 1 days 00:54:07 | 00:01:01 | 2379.0 | 1885.0 | 1804.0 | 28476.0 | 102.0 |
| September | 179.184 | 0.22760 | 17.0 | 00:23:06 | 00:14:53 | 00:39:39 | 1 days 00:57:57 | 00:01:01 | 2375.0 | 1871.0 | 1814.0 | 29389.0 | 102.0 | ||
| October | 96.925 | 0.22574 | 17.0 | 00:22:36 | 00:14:05 | 00:41:09 | 1 days 00:49:22 | 00:01:02 | 2254.0 | 1747.0 | 1745.0 | 28915.0 | 102.0 | ||
| November | 29.143 | 0.20890 | 15.0 | 00:19:50 | 00:12:05 | 00:39:43 | 0 days 23:14:55 | 00:01:01 | 2036.0 | 1582.0 | 1593.0 | 21024.0 | 102.0 | ||
| December | 17.891 | 0.18177 | 14.0 | 00:20:52 | 00:11:53 | 00:46:24 | 1 days 00:12:59 | 00:01:06 | 1951.0 | 1531.0 | 1573.0 | 22524.0 | 103.0 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Member | Electric Bike | March | 47.855 | 0.47299 | 17.0 | 00:10:31 | 00:07:47 | 00:12:45 | 0 days 07:58:32 | 00:01:01 | 2276.0 | 1689.0 | 1872.0 | 25216.0 | 103.0 |
| April | 58.940 | 0.58292 | 17.0 | 00:10:27 | 00:07:54 | 00:10:27 | 0 days 07:20:46 | 00:01:01 | 2290.0 | 1714.0 | 1843.0 | 18663.0 | 100.0 | ||
| May | 79.763 | 0.59476 | 17.0 | 00:12:08 | 00:09:15 | 00:11:08 | 0 days 07:45:51 | 00:01:01 | 2508.0 | 1912.0 | 1996.0 | 23343.0 | 101.0 | ||
| June | 86.882 | 0.62357 | 17.0 | 00:12:13 | 00:09:30 | 00:11:00 | 0 days 05:47:04 | 00:01:01 | 2589.0 | 1986.0 | 2037.0 | 24669.0 | 103.0 | ||
| July | 107.186 | 0.65341 | 17.0 | 00:12:00 | 00:09:23 | 00:10:26 | 0 days 07:05:03 | 00:01:01 | 2550.0 | 1956.0 | 2007.0 | 26003.0 | 101.0 |
72 rows × 13 columns
Pivot - Member-Type & most used Station#
station_concat = pd.concat([df_cleaned_v3[['member_casual','start_station_id','start_station_name','day_of_ride','month_of_ride','rideable_type']]\
.rename(columns={'start_station_id':'station_id', 'start_station_name': 'station_name'}),
df_cleaned_v3[['member_casual','end_station_id','end_station_name','day_of_ride','month_of_ride','rideable_type']]\
.rename(columns={'end_station_id':'station_id', 'end_station_name': 'station_name'})])
most_common_station = station_concat.groupby(['member_casual','station_id', 'station_name'])['station_id'].count().\
rename('no_of_rides').sort_values(ascending=False)
dfs = []
for idx, temp_df in most_common_station.groupby(level=0):
dfs.append(temp_df.nlargest(5))
most_common_station = pd.concat(dfs).to_frame()
most_common_station
| no_of_rides | |||
|---|---|---|---|
| member_casual | station_id | station_name | |
| Casual | 13022 | Streeter Dr & Grand Ave | 101345 |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 50795 | |
| 13300 | DuSable Lake Shore Dr & Monroe St | 48169 | |
| 13008 | Millennium Park | 47821 | |
| 13042 | Michigan Ave & Oak St | 45227 | |
| Member | KA1503000043 | Kingsbury St & Kinzie St | 49100 |
| TA1308000050 | Wells St & Concord Ln | 43748 | |
| TA1307000039 | Clark St & Elm St | 43500 | |
| KA1504000135 | Wells St & Elm St | 38061 | |
| WL-012 | Clinton St & Washington Blvd | 37214 |
Pivot - Member-Type & Month & most used Station#
df_temp = station_concat.groupby(['member_casual','month_of_ride','station_id','station_name'])['station_id'].count().\
rename('no_of_rides').sort_values(ascending=False)
dfs = []
for idx, temp_df in df_temp.groupby(level=[0,1]):
dfs.append(temp_df.nlargest(2))
most_common_station_month = pd.concat(dfs).to_frame()
most_common_station_month
| no_of_rides | ||||
|---|---|---|---|---|
| member_casual | month_of_ride | station_id | station_name | |
| Casual | August | 13022 | Streeter Dr & Grand Ave | 18915 |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 11584 | ||
| September | 13022 | Streeter Dr & Grand Ave | 15179 | |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 7954 | ||
| October | 13022 | Streeter Dr & Grand Ave | 7980 | |
| 13008 | Millennium Park | 5229 | ||
| November | 13022 | Streeter Dr & Grand Ave | 2746 | |
| 13300 | DuSable Lake Shore Dr & Monroe St | 1679 | ||
| December | 13022 | Streeter Dr & Grand Ave | 1683 | |
| 13008 | Millennium Park | 1267 | ||
| January | TA1307000039 | Clark St & Elm St | 216 | |
| KA1504000135 | Wells St & Elm St | 197 | ||
| February | 13022 | Streeter Dr & Grand Ave | 355 | |
| 13008 | Millennium Park | 250 | ||
| March | 13022 | Streeter Dr & Grand Ave | 3674 | |
| 13300 | DuSable Lake Shore Dr & Monroe St | 2055 | ||
| April | 13022 | Streeter Dr & Grand Ave | 5002 | |
| 13300 | DuSable Lake Shore Dr & Monroe St | 2480 | ||
| May | 13022 | Streeter Dr & Grand Ave | 12295 | |
| 13300 | DuSable Lake Shore Dr & Monroe St | 6312 | ||
| June | 13022 | Streeter Dr & Grand Ave | 15281 | |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 8984 | ||
| July | 13022 | Streeter Dr & Grand Ave | 18042 | |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 10006 | ||
| Member | August | LF-005 | DuSable Lake Shore Dr & North Blvd | 6629 |
| TA1308000050 | Wells St & Concord Ln | 6343 | ||
| September | TA1308000050 | Wells St & Concord Ln | 5839 | |
| TA1307000039 | Clark St & Elm St | 5381 | ||
| October | KA1503000014 | Ellis Ave & 60th St | 5975 | |
| TA1307000039 | Clark St & Elm St | 5155 | ||
| November | KA1503000014 | Ellis Ave & 60th St | 4208 | |
| KA1503000043 | Kingsbury St & Kinzie St | 4056 | ||
| December | KA1503000043 | Kingsbury St & Kinzie St | 2922 | |
| WL-012 | Clinton St & Washington Blvd | 2427 | ||
| January | KA1503000043 | Kingsbury St & Kinzie St | 1780 | |
| TA1305000032 | Clinton St & Madison St | 1382 | ||
| February | KA1503000071 | University Ave & 57th St | 2010 | |
| KA1503000014 | Ellis Ave & 60th St | 1903 | ||
| March | KA1503000043 | Kingsbury St & Kinzie St | 3280 | |
| WL-012 | Clinton St & Washington Blvd | 2788 | ||
| April | KA1503000043 | Kingsbury St & Kinzie St | 3745 | |
| WL-012 | Clinton St & Washington Blvd | 3316 | ||
| May | KA1503000071 | University Ave & 57th St | 4925 | |
| KA1503000014 | Ellis Ave & 60th St | 4873 | ||
| June | LF-005 | DuSable Lake Shore Dr & North Blvd | 5696 | |
| KA1503000043 | Kingsbury St & Kinzie St | 5395 | ||
| July | LF-005 | DuSable Lake Shore Dr & North Blvd | 5760 | |
| KA1503000043 | Kingsbury St & Kinzie St | 5333 |
Pivot - Member-Type & Day & most used Station#
df_temp = station_concat.groupby(['member_casual','day_of_ride','station_id','station_name'])['station_id'].count().\
rename('no_of_rides').sort_values(ascending=False)
dfs = []
for idx, temp_df in df_temp.groupby(level=[0,1]):
dfs.append(temp_df.nlargest(3))
most_common_station_weekday = pd.concat(dfs).to_frame()
most_common_station_weekday
| no_of_rides | ||||
|---|---|---|---|---|
| member_casual | day_of_ride | station_id | station_name | |
| Casual | Monday | 13022 | Streeter Dr & Grand Ave | 13005 |
| 13008 | Millennium Park | 6585 | ||
| LF-005 | DuSable Lake Shore Dr & North Blvd | 6156 | ||
| Tuesday | 13022 | Streeter Dr & Grand Ave | 8295 | |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 5593 | ||
| 13008 | Millennium Park | 4030 | ||
| Wednesday | 13022 | Streeter Dr & Grand Ave | 8850 | |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 5692 | ||
| 13042 | Michigan Ave & Oak St | 3858 | ||
| Thursday | 13022 | Streeter Dr & Grand Ave | 10072 | |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 6440 | ||
| 13008 | Millennium Park | 5010 | ||
| Friday | 13022 | Streeter Dr & Grand Ave | 12921 | |
| 13008 | Millennium Park | 6440 | ||
| 13300 | DuSable Lake Shore Dr & Monroe St | 5771 | ||
| Saturday | 13022 | Streeter Dr & Grand Ave | 25955 | |
| 13300 | DuSable Lake Shore Dr & Monroe St | 13351 | ||
| 13042 | Michigan Ave & Oak St | 11547 | ||
| Sunday | 13022 | Streeter Dr & Grand Ave | 22247 | |
| 13300 | DuSable Lake Shore Dr & Monroe St | 11497 | ||
| LF-005 | DuSable Lake Shore Dr & North Blvd | 10715 | ||
| Member | Monday | KA1503000043 | Kingsbury St & Kinzie St | 8045 |
| WL-012 | Clinton St & Washington Blvd | 6262 | ||
| TA1307000039 | Clark St & Elm St | 6184 | ||
| Tuesday | KA1503000043 | Kingsbury St & Kinzie St | 8751 | |
| WL-012 | Clinton St & Washington Blvd | 8329 | ||
| TA1305000032 | Clinton St & Madison St | 7170 | ||
| Wednesday | KA1503000043 | Kingsbury St & Kinzie St | 8332 | |
| WL-012 | Clinton St & Washington Blvd | 7960 | ||
| TA1305000032 | Clinton St & Madison St | 6871 | ||
| Thursday | KA1503000043 | Kingsbury St & Kinzie St | 7638 | |
| WL-012 | Clinton St & Washington Blvd | 7544 | ||
| TA1305000032 | Clinton St & Madison St | 6536 | ||
| Friday | KA1503000043 | Kingsbury St & Kinzie St | 6520 | |
| TA1308000050 | Wells St & Concord Ln | 5909 | ||
| TA1307000039 | Clark St & Elm St | 5781 | ||
| Saturday | TA1308000050 | Wells St & Concord Ln | 7331 | |
| 13179 | Clark St & Lincoln Ave | 6156 | ||
| TA1307000039 | Clark St & Elm St | 5863 | ||
| Sunday | TA1308000050 | Wells St & Concord Ln | 6157 | |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 6002 | ||
| TA1308000001 | Theater on the Lake | 5561 |
Pivot - Member-Type & Ride-Type & most used Station#
df_temp = station_concat.groupby(['member_casual','rideable_type','station_id','station_name'])['station_id'].count().\
rename('no_of_rides').sort_values(ascending=False)
dfs = []
for idx, temp_df in df_temp.groupby(level=[0,1]):
dfs.append(temp_df.nlargest(5))
most_common_station_ridetype = pd.concat(dfs).to_frame()
most_common_station_ridetype
| no_of_rides | ||||
|---|---|---|---|---|
| member_casual | rideable_type | station_id | station_name | |
| Casual | Classic Bike | 13022 | Streeter Dr & Grand Ave | 57559 |
| LF-005 | DuSable Lake Shore Dr & North Blvd | 34245 | ||
| 13042 | Michigan Ave & Oak St | 27200 | ||
| 13300 | DuSable Lake Shore Dr & Monroe St | 27005 | ||
| 13008 | Millennium Park | 24830 | ||
| Docked Bike | 13022 | Streeter Dr & Grand Ave | 23815 | |
| 13008 | Millennium Park | 11906 | ||
| 13300 | DuSable Lake Shore Dr & Monroe St | 10975 | ||
| 15544 | Shedd Aquarium | 9138 | ||
| 13042 | Michigan Ave & Oak St | 8026 | ||
| Electric Bike | 13022 | Streeter Dr & Grand Ave | 19971 | |
| 13008 | Millennium Park | 11085 | ||
| TA1308000050 | Wells St & Concord Ln | 10608 | ||
| 13300 | DuSable Lake Shore Dr & Monroe St | 10189 | ||
| 13042 | Michigan Ave & Oak St | 10001 | ||
| Member | Classic Bike | KA1503000043 | Kingsbury St & Kinzie St | 34308 |
| TA1307000039 | Clark St & Elm St | 31554 | ||
| TA1308000050 | Wells St & Concord Ln | 31106 | ||
| KA1503000014 | Ellis Ave & 60th St | 30064 | ||
| KA1503000071 | University Ave & 57th St | 29644 | ||
| Electric Bike | KA1503000043 | Kingsbury St & Kinzie St | 14792 | |
| WL-012 | Clinton St & Washington Blvd | 13432 | ||
| TA1308000050 | Wells St & Concord Ln | 12642 | ||
| TA1307000039 | Clark St & Elm St | 11946 | ||
| TA1305000032 | Clinton St & Madison St | 11636 |
EXPORT PIVOTS TO EXCEL#
Deep Dive - Visualization#
Let’s first take a look at the distribution of rides used by member and casual riders as well as what type of ride they use
Distribution of rides#
df_member_type_distri = df_cleaned_v3.groupby(['member_casual','rideable_type'])['ride_id'].count().rename('no_of_rides').reset_index()
df_member_type_distri['Total'] = " "
# df_member_type_distri['rideable_type'] = df_member_type_distri['rideable_type'].apply(lambda x: x.replace('_',' '))
# df_member_type_distri[['member_casual', 'rideable_type']] = df_member_type_distri[['member_casual', 'rideable_type']] \
# .astype(str).apply(lambda col: col.str.title())
fig = px.sunburst(df_member_type_distri, path=['Total','member_casual', 'rideable_type'],
values = 'no_of_rides',
maxdepth=3,
template="plotly_white",
color_discrete_sequence=['#E24A33', '#348ABD'],
height=900
)
fig.update_traces(textinfo= "label+percent entry")
fig.update_layout(title_text="<b>Distribution by Ride-Type and Member</b>",title_x=0.5,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
fig.write_image('./../pictures/sunburst_member_ridetype.png')
fig.show()
Both parties seem to have a preference for the class bike instead of an electric bike. This preference is strongest for the members who prefer to use the classic bike for 71.7% of their rides vs casual riders who use it for 58.5% of their rides.
Also casual riders are the only ones who use the option of the docked bike.
Ride distance#
Next we will look at a quick distribution of the distance classified into 6 categories from Short to X-Long (see annotations).
df_dist_class_distri = df_cleaned_v3.groupby('dist_class')['ride_id'].count().reset_index()
fig, ax = plt.subplots(subplot_kw=dict(aspect="equal"))
wedges, texts, autotexts = ax.pie(df_dist_class_distri['ride_id'], autopct='%1.1f%%', pctdistance=0.8, wedgeprops = {'width': 0.5, 'linewidth' : 1, 'edgecolor' : 'white' },
colors = sns.color_palette("Blues")[:len(df_dist_class_distri)], startangle= 0,
textprops={'fontsize':24, 'fontweight':'bold'}
)
ax.set_title('Distribution of distance travelled by classification',
fontsize = 36, weight = 'bold')
annotations = ["Short\n x < 500m",
"Short-Medium\n 500m <= x < 1.000m",
"Medium\n 1.000m <= x < 2.500m",
"Medium-Long\n 2.500m <= x < 5.000m",
"Long\n 5.000m <= x < 10.000m",
"X-Long\n 10.000m <= x"]
bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
kw = dict(arrowprops=dict(arrowstyle="-"),
bbox=bbox_props, zorder=0, va="center")
for i, p in enumerate(wedges):
ang = (p.theta2 - p.theta1)/2. + p.theta1
y = np.sin(np.deg2rad(ang))
x = np.cos(np.deg2rad(ang))
horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
connectionstyle = "angle,angleA=0,angleB={}".format(ang)
kw["arrowprops"].update({"connectionstyle": connectionstyle})
ax.annotate(annotations[i], xy=(x, y), xytext=(1.2*np.sign(x), 1.2*y),
horizontalalignment=horizontalalignment, **kw)
# del df_dist_class_distri
plt.savefig('./../pictures/donut_distclass.png', transparent = True)
plt.show()
But is there a difference between the two groups and their ride distance?
Histogram#
fig, ax = plt.subplots()
member = df_cleaned_v3.loc[df_cleaned_v3['member_casual']=='Member']['dist_ride']
casual = df_cleaned_v3.loc[df_cleaned_v3['member_casual']=='Casual']['dist_ride']
ax.hist([casual,member], bins =150, histtype='bar', stacked=True, label=['Casual', 'Member'], range=[0, 15000])
ax.set_title('Histogram of rides by distance', pad = 40)
ax.legend(title='Member type', loc='upper right')
ax.set_ylabel('# of rides', fontsize = 32, weight = 'bold')
ax.set_xlabel('Distance travelled [km]', fontsize = 32, weight = 'bold')
ax.xaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p:
f'{locale.format_string("%d", round(x/1000,2), grouping=True)}km'))
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p:
f'{locale.format_string("%d", x, grouping=True)}'))
fig.subplots_adjust(top=.95)
plt.savefig('./../pictures/hist_ridedistance.png', transparent = True)
plt.show()
The histogramm suggests that the casual riders ride slightly longer distances. Also it can be seen very well that the members use Cyclist more often.
But is there a difference in ride behaviour (number of rides, as well as distance) in the weekdays?
Violinplot for member-type and weekday#
fig, ax = plt.subplots()
# boxplot = df_cleaned_v3.boxplot(column=['dist_ride','ride_timedelta'], by='member_casual')
ax = sns.violinplot(x="day_of_ride", y="dist_ride", hue="member_casual", hue_order = ['Casual','Member'],
data=df_cleaned_v3, linewidth=2.5, whis=[1,99], scale='count',
order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday' ],
# height = 1200, widht= 1600
)
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: f'{locale.format_string("%d", round(x/1000,2), grouping=True)}km'))
ax.legend(fontsize=26, title='Type of Member')
ax.set_title('Distribution of distance travelled by Member-Type and Weekday',
fontsize = 36, weight = 'bold', pad = 40)
ax.set_ylabel('Distance travelled [m]', fontsize = 28, weight = 'bold')
ax.set_xlabel('Weekday', fontsize = 28, weight = 'bold')
medians_dist = df_cleaned_v3.groupby(['day_of_ride','member_casual'])['dist_ride'].median()
max_dist = df_cleaned_v3.groupby(['day_of_ride','member_casual'])['dist_ride'].max()
vertical_offset = 1.2 # offset from median for display
i = 0
for xtick1 in ax.get_xticks():
for n in range(2):
i+= n%2
iterator = xtick1 + i
x_offset = xtick1 - 0.2 + 0.4*n
t = ax.text(x_offset,medians_dist[iterator] * vertical_offset,f'{round(medians_dist[iterator]/1000,2)}km',
horizontalalignment='center',size='medium',color='black',weight='semibold')
t.set_bbox(dict(facecolor='white', alpha=0.5))
ax.text(x_offset,max_dist[iterator]*1.03, f'{round(max_dist[iterator]/1000,2)}km',
horizontalalignment='center',size='medium',color='firebrick',weight='semibold')
plt.savefig('./../pictures/violin_ridedist_weekday.png', transparent = True)
plt.show()
The violinplot of the ridedistance grouped by weekday and member type clearly shows that the number of rides for the casual members is significantly less in the workweek days (Mo-Fr) as shown by the slimmer widths of the violins. On weekends the casual riders even outweigh the members.
For both parties it also shows that the weekend comes with increased ride distances.
Next to the weekday is there also a difference when it comes to the months or the season?
Violinplot for member-type and month#
fig, ax = plt.subplots()
# boxplot = df_cleaned_v3.boxplot(column=['dist_ride','ride_timedelta'], by='member_casual')
ax = sns.violinplot(x="month_of_ride", y="dist_ride", hue="member_casual", hue_order = ['Casual','Member'],
data=df_cleaned_v3, linewidth=2.5, whis=[1,99], scale='count'
)
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: f'{locale.format_string("%d", round(x/1000,2), grouping=True)}km'))
ax.legend(fontsize=26, title='Type of Member')
ax.set_title('Distribution of distance travelled by Member-Type and Month', pad = 40,
fontsize = 36, weight = 'bold')
ax.set_ylabel('Distance travelled [m]', fontsize = 28, weight = 'bold')
ax.set_xlabel('Month', fontsize = 28, weight = 'bold')
plt.xticks(rotation = 45)
medians_dist = df_cleaned_v3.groupby(['month_of_ride','member_casual'])['dist_ride'].median()
max_dist = df_cleaned_v3.groupby(['month_of_ride','member_casual'])['dist_ride'].max()
# print(medians)
vertical_offset = 1.2 # offset from median for display
i = 0
for xtick1 in ax.get_xticks():
for n in range(2):
i+= n%2
iterator = xtick1 + i
x_offset = xtick1 - 0.2 + 0.4*n
t = ax.text(x_offset,medians_dist[iterator] * vertical_offset,f'{round(medians_dist[iterator]/1000,2)}km',
horizontalalignment='center',size='medium',color='black',weight='semibold')
t.set_bbox(dict(facecolor='white', alpha=0.5))
ax.text(x_offset,max_dist[iterator]*1.05, f'{round(max_dist[iterator]/1000,2)}km',
horizontalalignment='center',size='medium',color='firebrick',weight='semibold')
plt.savefig('./../pictures/violin_ridedist_month.png', transparent = True)
plt.show()
The above violinplot is similar to the one before with the only diffeerence that the weekdays are substituted by the months. Here we also see a drastic reduction in rides (width of violin) for the cold months starting from October and stretching to March. January and February are especially low volume for the casual riders whereas the members still ride almost as much as during the summers.
Also for both parties the distance travelled is shorter during the winter months.
Ride length [t]#
Next let’s look at the difference in riding behaviour for the length of the ride.
#### Boxplot - Ride-length by Member-Type and Weekday
ax = sns.boxplot(x="day_of_ride", y="length_of_ride_s", hue="member_casual", hue_order = ['Casual','Member'],
data=df_cleaned_v3, linewidth=2.5, whis=[1,99]
)
ax.legend(fontsize=26, title='Type of Member')
ax.set_title('Distribution of length of ride by Member-Type and Weekday',
fontsize = 36, weight = 'bold', pad = 40)
ax.set_ylabel('time travlled [t]', fontsize = 28, weight = 'bold')
ax.set_yscale('log')
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: pd.to_timedelta(x, unit='s')))
ax.set_xlabel('Weekday', fontsize = 28, weight = 'bold')
medians_time = df_cleaned_v3.groupby(['day_of_ride','member_casual'])['length_of_ride_s'].median()
max_time = df_cleaned_v3.groupby(['day_of_ride','member_casual'])['length_of_ride_s'].max()
top_99 = df_cleaned_v3.groupby(['day_of_ride','member_casual'])['length_of_ride_s'].quantile(q=0.99, interpolation='nearest')
vertical_offset = 1.2 # offset from median for display
i = 0
for xtick1 in ax.get_xticks():
for n in range(2):
i+= n%2
iterator = xtick1 + i
x_offset = xtick1 - 0.2 + 0.4*n
t = ax.text(x_offset,medians_time[iterator] * vertical_offset, strfdelta(datetime.timedelta(seconds=medians_time[iterator]),'%H:%M'),
horizontalalignment='center',size='medium',color='black',weight='semibold')
t.set_bbox(dict(facecolor='white', alpha=0.5))
ax.text(x_offset,max_time[iterator]*1.15, datetime.timedelta(seconds=max_time[iterator]),
horizontalalignment='center',size='medium',color='firebrick',weight='semibold')
t2 = ax.text(x_offset,top_99[iterator]*1.15, datetime.timedelta(seconds=top_99[iterator]),
horizontalalignment='center',size='medium',color='firebrick',weight='semibold')
t2.set_bbox(dict(facecolor='white', alpha=0.9))
medians_time = medians_time.reset_index()
max_time = max_time.reset_index()
top_99 = top_99.reset_index()
plt.savefig('./../pictures/boxplot_ridelength_day.png' , transparent = True)
plt.show()
The above boxplot shows that the casual riders spent a significant time longer on the bike - by 54.4% to be exact (Casual: 00:14:25 vs Member: 00:09:20).
In the case of the ride length for Top 1% of rides (99th Percentile) this is even more drastic: Casual: 02:22:22 vs Member: 00:48:45.
So far this shows a clear difference in riding behaviour during the weekend, but is there also more differences throughout the year?
Boxplot - Ride-length by Member-Type and Month#
ax = sns.boxplot(x="month_of_ride", y="length_of_ride_s", hue="member_casual", hue_order = ['Casual','Member'],
data=df_cleaned_v3, linewidth=2.5, whis=[1,99]
)
ax.legend(fontsize=26, title='Type of Member')
ax.set_title('Distribution of length of ride by Member-Type and Month', pad = 40,
fontsize = 36, weight = 'bold')
ax.set_ylabel('time travlled [t]', fontsize = 28, weight = 'bold')
ax.set_yscale('log')
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: pd.to_timedelta(x, unit='s')))
ax.set_xlabel('Month', fontsize = 28, weight = 'bold')
plt.xticks(rotation = 45)
medians_time = df_cleaned_v3.groupby(['month_of_ride','member_casual'])['length_of_ride_s'].median()
max_time = df_cleaned_v3.groupby(['month_of_ride','member_casual'])['length_of_ride_s'].max()
top_99 = df_cleaned_v3.groupby(['month_of_ride','member_casual'])['length_of_ride_s'].quantile(q=0.99, interpolation='nearest')
vertical_offset = 1.2 # offset from median for display
i = 0
for xtick1 in ax.get_xticks():
for n in range(2):
i+= n%2
iterator = xtick1 + i
x_offset = xtick1 - 0.2 + 0.4*n
t = ax.text(x_offset,medians_time[iterator] * vertical_offset, strfdelta(datetime.timedelta(seconds=medians_time[iterator]),'%H:%M'),
horizontalalignment='center',size='medium',color='black',weight='semibold')
t.set_bbox(dict(facecolor='white', alpha=0.5))
ax.text(x_offset,max_time[iterator]*1.15, datetime.timedelta(seconds=max_time[iterator]),
horizontalalignment='center',size='medium',color='firebrick',weight='semibold')
t2 = ax.text(x_offset,top_99[iterator]*1.15, datetime.timedelta(seconds=top_99[iterator]),
horizontalalignment='center',size='medium',color='firebrick',weight='semibold')
t2.set_bbox(dict(facecolor='white', alpha=0.9))
medians_time = medians_time.reset_index()
max_time = max_time.reset_index()
top_99 = top_99.reset_index()
plt.savefig('./../pictures/boxplot_ridelength_month.png', transparent = True)
plt.show()
The general difference in the ride length persisted throughout the year comparing the two groups. There is howevere a synchronous change in ride length throughout the year for both. During the winter months (Oct - Mar) the length of the ride drops by a significant ammount (ca. 30%) for both. The drop is slightly larger for the casual riders.
This seems to be innline with the previous findings that also the distance travelled is reduced in those months - however not by 30%. During the winter months the riders seem to hurry more to get to their destination. Whereas in the summer the motto seems to be "the ride is the goal" in Winter it shifts to "the destination is the goal".
Relation of distance and ride length#
Now let’s take a look at the relationship between the distance of the ride and the length. By now we know that the distance travelled is slightly longer for casual riders, but the time needed for this more than the distance would suggest.
If there is a correlation between the ride distance and length of the ride, I expect it to be stronger for the members who travel almost as much as the casual riders in distance, but do it in a faster manner.
Also the docked bike has some gigantic outliers with the ride length (> 20 days). To get a good grasp lets take a look at the relationship of distance and ride length grouped by member-type and ride-type.
Linear Model of rides#
import scipy as sp
# annotate lmplots with pearson r and p-value to understand the correlation and statistical significance
# used from https://stackoverflow.com/questions/25579227/seaborn-lmplot-with-equation-and-r2-text
def annotate(data, **kws):
r, p = sp.stats.pearsonr(data['length_of_ride_s'], data['dist_ride'])
ax = plt.gca()
ax.text(.05, .8, 'r={:.2f}, p={:.2g}'.format(r, p), size = 14,
transform=ax.transAxes)
df_implot_temp = df_cleaned_v3[['member_casual','rideable_type', 'length_of_ride_s', 'dist_ride']]
df_implot_temp['rideable_type'] = df_implot_temp['rideable_type'].apply(lambda x: x.replace('_',' '))
sns.set(rc={'figure.figsize':(64,32),
'xtick.labelsize':16,
'ytick.labelsize':16})
g = sns.lmplot(data = df_implot_temp, x = 'length_of_ride_s', y = 'dist_ride', row = 'member_casual',
ci = 99, col ='rideable_type', scatter_kws = {'alpha' : 0.1}, hue = 'member_casual', hue_order = ['Casual', 'Member'],
facet_kws = {'sharex':True, 'sharey':True, 'legend_out':True, 'margin_titles':True},
col_order = ['Classic Bike', 'Electric Bike', 'Docked Bike'], row_order= ['Casual', 'Member'],
truncate=False
)
g.set(xlim=(0, 24*60**2))
g.set(ylim=(0, 30*10**3))
plt.subplots_adjust(hspace=0.35)
g.set_axis_labels("length of ride [t]", "distance travelled [km]", fontsize = 20, weight = 'bold')
for ax in g.axes.flat:
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: f'{round(x/1000,1)}km'))
ax.xaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: strfdelta(pd.to_timedelta(x, unit='s'), '%H:%M')))
for label in ax.get_xticklabels():
label.set_rotation(45)
g.map_dataframe(annotate)
g.fig.suptitle('Relation of ride length vs distance travelled',
fontsize = 28, weight = 'bold')
g.set_titles(col_template="{col_name}", row_template="{row_name}", size = 20, weight = 'bold')
g.figure.subplots_adjust(top=.9)
plt.savefig('./../pictures/lmplot_length_vs_dist.png', transparent = True)
plt.show()
C:\Users\kemke\AppData\Local\Temp\ipykernel_9028\1318206295.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# DF of classic bike
df_implot_temp = df_cleaned_v3[['member_casual','rideable_type', 'month_of_ride','length_of_ride_s', 'dist_ride']]\
.loc[(df_cleaned_v3['rideable_type']=="Classic Bike")]
sns.set(rc={'figure.figsize':(64,32),
'xtick.labelsize':16,
'ytick.labelsize':16})
g = sns.lmplot(data = df_implot_temp, x = 'length_of_ride_s', y = 'dist_ride', ci = 99, col ='member_casual', col_order = ['Casual', 'Member'],
row = 'month_of_ride', scatter_kws = {'alpha' : 0.1}, hue = 'member_casual', hue_order = ['Casual', 'Member'], aspect = 1.5,
facet_kws = {'sharex':True, 'sharey':True, 'legend_out':True, 'margin_titles':True},
# col_order = ['Classic Bike', 'Electric Bike', 'Docked Bike'], row_order= ['Casual', 'Member'],
truncate=False
)
g.set(xlim=(0, 8*60**2)) # limit to 8 hours day
g.set(ylim=(0, 30*10**3)) # limit to 30km
plt.subplots_adjust(hspace=0.25)
g.set_axis_labels("length of ride [t]", "distance travelled [km]", fontsize = 20, weight = 'bold')
for ax in g.axes.flat:
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: f'{round(x/1000,1)}km'))
ax.xaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: strfdelta(pd.to_timedelta(x, unit='s'), '%H:%M')))
for label in ax.get_xticklabels():
label.set_rotation(45)
g.map_dataframe(annotate)
g.set_titles(col_template="{col_name}", row_template="{row_name}", size = 20, weight = 'bold')
g.fig.suptitle("""Relation of ride length vs distance travelled
Classic Bike""",
fontsize = 28, weight = 'bold')
g.figure.subplots_adjust(top=.96)
plt.savefig('./../pictures/lmplot_length_vs_dist_classic_bike.png', dpi = 300, transparent = True)
plt.show()
# DF of electric bike
df_implot_temp = df_cleaned_v3[['member_casual','rideable_type', 'month_of_ride','length_of_ride_s', 'dist_ride']]\
.loc[(df_cleaned_v3['rideable_type']=="Electric Bike")]
sns.set(rc={'figure.figsize':(64,32),
'xtick.labelsize':16,
'ytick.labelsize':16})
g = sns.lmplot(data = df_implot_temp, x = 'length_of_ride_s', y = 'dist_ride', ci = 99, col ='member_casual', col_order = ['Casual', 'Member'],
row = 'month_of_ride', scatter_kws = {'alpha' : 0.1}, hue = 'member_casual', hue_order = ['Casual', 'Member'], aspect = 1.5,
facet_kws = {'sharex':True, 'sharey':True, 'legend_out':True, 'margin_titles':True},
# col_order = ['Classic Bike', 'Electric Bike', 'Docked Bike'], row_order= ['Casual', 'Member'],
truncate=False
)
g.set(xlim=(0, 8*60**2)) # limit to 8 hours
g.set(ylim=(0, 30*10**3))
plt.subplots_adjust(hspace=0.25)
g.set_axis_labels("length of ride [t]", "distance travelled [km]", fontsize = 20, weight = 'bold')
for ax in g.axes.flat:
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: f'{round(x/1000,1)}km'))
ax.xaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: strfdelta(pd.to_timedelta(x, unit='s'), '%H:%M')))
for label in ax.get_xticklabels():
label.set_rotation(45)
g.map_dataframe(annotate)
g.set_titles(col_template="{col_name}", row_template="{row_name}", size = 20, weight = 'bold')
g.fig.suptitle("""Relation of ride length vs distance travelled
Electric Bike""",
fontsize = 28, weight = 'bold')
g.figure.subplots_adjust(top=.96)
plt.savefig('./../pictures/lmplot_length_vs_dist_electric_bike.png', dpi = 300, transparent = True)
plt.show()
Stations and routes used#
Now let’s take a look if there is a geographical difference as well. Let’s look at the most used stations for this and then plot it into the map.
Barchart - Visited Stations#
try:
most_common_station.reset_index(inplace=True)
except:
pass
g = sns.FacetGrid(most_common_station, row="member_casual",hue="member_casual", size=8, aspect=2, sharex=False)
g.map(sns.barplot,'station_name','no_of_rides')
g.fig.suptitle('Most Used Station by Member-Type',
fontsize = 32, weight = 'bold')
g.fig.subplots_adjust(top=0.9)
g.set_axis_labels("Station", "# of Rides", fontsize=26, weight='bold')
g.set_titles(row_template="{row_name}", size=26)
plt.subplots_adjust(hspace=0.6)
for ax in g.axes.flat:
ax.yaxis.set_major_formatter(mat_tick.FuncFormatter(lambda x, p: locale.format_string('%d', x , grouping=True)))
for label in ax.get_xticklabels():
label.set_rotation(45)
plt.savefig('./../pictures/most_station.png')
plt.show()
c:\Users\kemke\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:337: UserWarning:
The `size` parameter has been renamed to `height`; please update your code.
c:\Users\kemke\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:670: UserWarning:
Using the barplot function without specifying `order` is likely to produce an incorrect plot.
Map Scatterplot - Visited Stations#
Now let’s visualize the difference in used stations and plot it onto the map. The size of the marker will indicate how frequently the station is visited by each member-type.
# Create DF for the map scatterplot
scatter_map_df = df_cleaned_v3[['member_casual','start_station_id','end_station_id']].melt(id_vars=['member_casual'], var_name='station_type', value_name='station_id')\
.groupby(['member_casual','station_id']).apply(lambda df: pd.Series({
'no_of_rides': df['station_type'].count()
})).reset_index()
mean_lat_lng = df_cleaned_v3.groupby(['start_station_id','start_station_name'])\
.apply(lambda df: pd.Series({
'lat': df['start_lat'].mean(),
'lng': df['start_lng'].mean(),
})).reset_index().rename(columns={'start_station_id':'station_id', 'start_station_name': 'station_name'})
scatter_map_df = scatter_map_df.merge(mean_lat_lng, how = 'left', on = 'station_id').dropna(how='any')
mapbox_acces_token = px.set_mapbox_access_token(open("../mapbox_token_public.json").read())
fig = px.scatter_mapbox(scatter_map_df,
lat=scatter_map_df['lat'],
lon=scatter_map_df['lng'],
size='no_of_rides',
color = 'member_casual',
hover_name="station_name",
hover_data = ['no_of_rides'],
opacity = 0.8,
zoom = 12,
labels = {'member_casual': 'Member Type'},
height = 1200,
width = 1200)
fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0}, # remove the white gutter between the frame and map
# hover appearance
hoverlabel=dict(
bgcolor="white", # white background
font_size=16,), # label font size
legend=dict(
yanchor="top",
y=0.98,
xanchor="right",
x=0.98,
),
title = dict(
x = 0.5,
xanchor = 'center',
text = '<b>Bike Stations Used by Member-Type and Frequency</b>',
font = dict(
size = 32,
)
)
)
fig.write_image('./../pictures/station_map.png')
fig.show()
Map Scatter- and Lineplot - Visited Stations and used routes#
Now let’s visualize the difference in used stations and the typical routes taken by the riders. For this however the number of combinations for the routes (station to station) will be too big and the plot would end up quite confusing.
We will limit the lines plotted to visualize the routes taken therefore to the top 200 routes for each member type.
# Create DF for the the different routes taken
scatter_map_rides_df = df_cleaned_v3.groupby(['member_casual','start_station_id','end_station_id'])['ride_id'].count()\
.rename('no_of_rides').reset_index()
scatter_map_rides_df =scatter_map_rides_df.loc[(scatter_map_rides_df['no_of_rides']!=0)]
print(scatter_map_rides_df.shape)
# Calculate the mean for Lat + Long for each station (otherwise one station would have several groups)
mean_lat_lng_start = df_cleaned_v3.groupby(['start_station_id','start_station_name'])\
.apply(lambda df: pd.Series({
'lat': df['start_lat'].mean(),
'lng': df['start_lng'].mean(),
'no_of_rides': df['ride_id'].count(),
})).reset_index()\
.rename(columns={'start_station_id':'station_id', 'start_station_name':'station_name'})
mean_lat_lng_end = df_cleaned_v3.groupby(['end_station_id','end_station_name'])\
.apply(lambda df: pd.Series({
'lat': df['end_lat'].mean(),
'lng': df['end_lng'].mean(),
'no_of_rides': df['ride_id'].count(),
})).reset_index()\
.rename(columns={'end_station_id':'station_id', 'end_station_name':'station_name'})
scatter_map_rides_df = scatter_map_rides_df.merge(mean_lat_lng_start, how = 'left',
left_on='start_station_id', right_on='station_id', suffixes=(None,'_start'))\
.merge(mean_lat_lng_end, how = 'left', left_on='end_station_id', right_on='station_id', suffixes=(None,'_end'))
scatter_map_rides_df.drop(columns=['station_id', 'station_id_end','no_of_rides_start', 'no_of_rides_end'], inplace=True)
scatter_map_rides_df.rename(columns={'lat':'lat_start', 'lng':'lng_start'}, inplace=True)
scatter_map_rides_df['line_width'] = round(scatter_map_rides_df['no_of_rides'] / scatter_map_rides_df['no_of_rides'].max() * 10,2)
scatter_map_rides_df['line_width'] = scatter_map_rides_df['line_width'].apply(lambda x: 1 if x < 1 else x)
scatter_map_rides_df.dropna(inplace=True)
(259344, 4)
# Create DF to plot stations - this time unified (no start + end separation or member-type separation)
stations_df = pd.concat([mean_lat_lng_start, mean_lat_lng_end])
stations_df = stations_df.groupby(['station_id', 'station_name']).apply(lambda df: pd.Series({
'lat': df['lat'].mean(),
'lng': df['lng'].mean(),
'no_of_rides': df['no_of_rides'].sum(),
})).reset_index()
mapbox_acces_token = px.set_mapbox_access_token(open("../mapbox_token_public.json").read())
# mapbox_acces_token = px.set_mapbox_access_token(open("mapbox_token_public.json").read())
fig = px.scatter_mapbox(data_frame=stations_df,
lat='lat',
lon='lng',
size= 'no_of_rides',
color_discrete_sequence = ['rgba(31,31,31, 0.9)'],
hover_name= 'station_name',
hover_data = ['no_of_rides'],
opacity = 0.8,
zoom = 12,)
for idx, temp_df in scatter_map_rides_df.groupby('member_casual'):
color = color_dict.get(idx, 'rgba(31, 31, 31, 1)').replace(", 1)", ", 0.5)")
temp_df.sort_values(by='no_of_rides', ascending=False, inplace=True)
temp_df = temp_df[:200].reset_index()
for i in range(len(temp_df)):
lat = [temp_df['lat_start'][i], temp_df['lat_end'][i]]
lon = [temp_df['lng_start'][i], temp_df['lng_end'][i]]
fig.add_trace(ptly_go.Scattermapbox(
mode = "lines",
lat = lat,
lon = lon,
name = idx,
hovertext = f"{temp_df['station_name'][i]} - {temp_df['station_name_end'][i]} \n # Rides: {temp_df['no_of_rides'][i]} ",
# hoverinfo = ['no_of_rides'],
line = dict(
color= color,
width = temp_df['line_width'][i]*2, # multiply by 2 because original value not big enoug
),
))
fig.update_layout(margin ={'l':0,'t':45,'b':0,'r':0},
mapbox = {
'zoom': 11},
width=1200,
height=1200,
legend=dict(
yanchor="top",
y=0.98,
xanchor="right",
x=0.98
),
title = dict(
x = 0.5,
xanchor = 'center',
text = '<b>Top 200 Routes by Member-Type</b>',
font = dict(
size = 32,
)
)
)
fig.update_traces(showlegend=False)
fig.write_image('./../pictures/routes_map.png')
fig.show()
The plot shows a clear concentration of the casual riders near the lake side and in or between the several parks.
For the members there are several little hubs or networks. One which is near downtown and a big one near the university.
Conclusion#
For the members the Cyclist bike seems to be more integrated into their daily life and used for work or similar things. They ride their bike throughout the year with little change and are not wasting time when they are riding.
In contrast the casual riders show a clear tendence towards riding as a leisure activity. They ride their bikes mostly on the weekens and during the hotter months, especially in summer. While they ride mostly in the parks they seem to stroll around which is why there is also a weaker correlation between the ride time and the distance of the ride. When winter hits, this however changes and the cold temperatures seem to shift the behaviour. For the few remaining casual riders the bike becomes more of a vehicle to get from point A to B instead of a long stroll through the new cihlly cold park.